arXiv:1501.07668vl [cond-mat.stat-mech] 30 Jan 2015 


Sloppiness and Emergent Theories in Physics, Biology, and Beyond 

Mark K. Transtrum , 1 Benjamin Machta , 2 Kevin Brown , 3,4 Bryan C. Daniels , 5 Christopher R. Myers , 6,7 and 
James Sethna 6 

^Department of Physics and Astronomy, Brigham Young University, Provo, Utah 84602, 

USA 

2 ) Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, 

USA 

3 ) Departments of Biomedical Engineering, Physics, Chemical and Biomolecular Engineering, and Marine Sciences, 
University of Connecticut, Storrs, CT, USA 

4 ) Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA 

5 ) Center for Complexity and Collective Computation, Wisconsin Institute for Discovery, University of Wisconsin, 
Madison, WI, USA 

6 ) Laboratory of Atomic and Solid State Physics, Cornell University, Ithaca, NY, 

USA 

7 ) Institute of Biotechnology, Cornell University, Ithaca, NY, USA 

Large scale models of physical phenomena demand the development of new statistical and computational 
tools in order to be effective. Many such models are ‘sloppy’, i.e., exhibit behavior controlled by a relatively 
small number of parameter combinations. We review an information theoretic framework for analyzing sloppy 
models. This formalism is based on the Fisher Information Matrix, which we interpret as a Riemannian metric 
on a parameterized space of models. Distance in this space is a measure of how distinguishable two models 
are based on their predictions. Sloppy model manifolds are bounded with a hierarchy of widths and extrinsic 
curvatures. We show how the manifold boundary approximation can extract the simple, hidden theory 
from complicated sloppy models. We attribute the success of simple effective models in physics as likewise 
emerging from complicated processes exhibiting a low effective dimensionality. We discuss the ramifications 
and consequences of sloppy models for biochemistry and science more generally. We suggest that the reason 
our complex world is understandable is due to the same fundamental reason: simple theories of macroscopic 
behavior are hidden inside complicated microscopic processes. 


I. PARAMETER INDETERMINACY AND SLOPPINESS 

As a young physicist, Freeman Dyson paid a visit to 
Enrico Fermi 1 -] (recounted in Ditley, Mayer, and Loew^). 
Dyson wanted to tell Fermi about a set of calculations 
that he was quite excited about. Fermi asked Dyson 
how many parameters needed to be tuned in the theory 
to match experimental data. When Dyson replied there 
were four, Fermi shared with Dyson a favorite adage of 
his that he had learned from Von Neumann: “with four 
parameters I can fit an elephant, and with five I can make 
him wiggle his trunk.” Dejected, Dyson took the next bus 
back to Ithaca. 

As scientists, we are frequently in a similar position 
to Dyson. We are often confronted with a model — a 
heavily parameterized, possibly incomplete or inaccurate 
mathematical representation of nature — rather than a 
theory (e. g., the Navier-Stokes equations) with few to 
no free parameters to tune. In recent decades, fueled by 
advances in computing capabilities, the size and scope 
of mathematical models has exploded. Massive complex 
models describing everything from biochemical reaction 
networks to climate to economics are now a centerpiece of 
scientific inquiry. The complexity of these models raises a 
number of challenges and questions, both technical and 
profound, and demands development of new statistical 
and computational tools to effectively use such models. 

Here we review several developments that have oc¬ 
curred in the domain of sloppy model research. Sloppy is 
the term used to describe a class of complex models ex¬ 


hibiting large parameter uncertainty when fit to data. 
Sloppy models were initially characterized in complex 
biochemical reaction networks®, but were soon after¬ 
ward found in a much larger class of phenomena includ¬ 
ing quantum Monte CarloTempirical atomic potential^, 
particle accelerator design-!, insect flight^, and critical 
phenomenaP. 

As a prototypical example, consider fitting decay data 
to a sum of exponentials with unknown decay rates: 

y(M) = X e_M - w 

p 

We denote the vector of unknown parameters by 6. These 
parameters are to be inferred from data, for example, by 
nonlinear least squares. This inference problem is no¬ 
toriously difficult 10 . Intuitively, we can understand why 
by noting that the effect of each individual parameter is 
obscured by our choice to observe only the sum. Param¬ 
eters have compensatory effects relative to the system’s 
collective behavior. A single decay rate can be decreased, 
for example, provided other rates are appropriately in¬ 
creased to compensate. 

This uncertainty can be quantified using statistical 
methods, as we detail in section [TTJ In particular, the 
Fisher Information Matrix (FIM) can be used to estimate 
the uncertainty in each parameter in our model. The re¬ 
sult for the sum of exponentials is that each parameter 
is almost completely undetermined. Any parameter can 
be varied by an infinite amount and the model could still 
fit the data. This does not mean that all parameters 
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FIG. 1 . Sloppy eigenvalue spectra of multiparameter 
models from various Eigenvalues of the FIM, 

indicating sensitivity to perturbations along orthogonal di¬ 
rections in parameter space, are roughly evenly spaced in log- 
space, extending over many orders of magnitude. 


can be varied independently of the others. Indeed, while 
the statistical uncertainty in each individual parameter 
might be infinite, the data places constraints on combi¬ 
nations of the parameters. 

The eigenvalues of the FIM tell us which parameter 
combinations are well-constrained by the data and which 
are not. Most of the FIM eigenvalues are very small, cor¬ 
responding to combinations of parameters that have little 
effect on model behavior. These unimportant parame¬ 
ter combinations are designated sloppy. A small number 
of eigenvalues are relatively large, revealing the few pa¬ 
rameter combinations that are important to the model 
(known as stiff). It is generally observed that the FIM 
eigenvalues decay roughly log-linearly, with each param¬ 
eter combination being less important than the previous 
by a fixed factor, as in Figure [1] Consequently there is 
not a well-defined boundary between the stiff and sloppy 
combinations, and four parameters really can “fit the ele¬ 
phant”. 

The degree of parameter indeterminacy in the simple 
sum-of-exponentials model has been seen in many com¬ 
plex models of real life systems for many of the same 
reasons. The FIMs for seventeen systems biology models 
have been shown to have the same characteristic eigen¬ 
value structure 1 ^, and examples from other scientific do¬ 
mains abouncP. In each case, observations measure a 
system’s collective behavior, and this means that when 
parameters have compensatory effects they cannot be in¬ 
dividually identified. 

The ubiquity of sloppiness would seem to limit the use¬ 
fulness of complex parameterized models. If we cannot 
accurately know parameter values, how can a model be 
predictive? Surprisingly, predictions are possible with¬ 


out precise parameter knowledge. As long as the model 
predictions depend on the same stiff parameter combi¬ 
nations as the data, the predictions of the model will 
be constrained in spite of large numbers of poorly deter¬ 
mined parameters. 

The existence of a few stiff parameter combinations can 
be understood as a type of low effective dimensionality of 
the model. In section m we make this idea quantitative 
by considering a geometric interpretation of statistics. 
This leads naturally to a new method of model reduction 
that constructs low-dimensional approximations to high¬ 
dimensional models (section |TV| ) . These low-dimensional 
approximations are useful for revealing the emergent con¬ 
trol mechanisms that govern the system’s behavior, i.e., 
extracting a simple emergent theory of the collective be¬ 
havior from the larger, complex model. 

Simple approximations to complex processes are com¬ 
mon in physics (section [V]) . The ubiquity of sloppiness 
suggests that similarly simple models can be constructed 
for other complex systems. Indeed, sloppiness has a num¬ 
ber of profound implications for the unreasonable effec¬ 
tiveness of mathematicd^ 3 and the hierarchical structure 
of scientific theorie^l. We discuss some of these con¬ 
sequences specifically for modeling biochemical networks 
in section |VT| We discuss more generally the implications 
of sloppiness for mathematical modeling in section lYS 
We argue that sloppiness is the underlying reason why 
the universe (a complete description of which would be 
indescribably complex) is comprehensible. 


II. MATHEMATICAL FRAMEWORK 

In this section we use information theory to define key 
measures of sloppiness geometrically 15 . We first consider 
the special case of model selection for models fit to data 
by least squares. We then generalize to the case of ar¬ 
bitrary probabilistic models. The key insight is that the 
Fisher Information defines a Riemannian geometry on the 
space of possible models^. The geometric picture allows 
us to show (in section |III| ) that this local sloppy struc¬ 
ture in the metric is paralleled by a global hyper-ribbon 
structure of the entire space of possible models. 

We begin with a simple case - a model y predicting 
data d at experimental conditions u, with independent 
Gaussian errors; each of these are vectors whose length 
M is given by the number of data points. Our model de¬ 
pends on N parameters 0 . In general, an arbitrary model 
is a mathematical mapping from a parameter space into 
predictions, so interpreting a model as a manifold of di¬ 
mension N embedded in a data space M m is natural; the 
parameters 6 then become the coordinates for the mani¬ 
fold. If our error bars are independent and Gaussian all 
with the same width (say, a — 1), finding the best fit of 
model to data is a least squares data fitting problem, as 
we illustrate in Figure [2] In this case, we assume that 
each experimental data point, is generated from a pa¬ 
rameterized model, y(ui, 0 ), plus random Gaussian noise, 
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Substrate Concentration ( u ) 



FIG. 2. The Model Manifold A simple modeP^^ of 

an enzyme-catalyzed reaction can be expressed as a rational 
function in substrate concentration ( u ) with four parameters 
( 0 ) predicting the reaction velocity ( y) (inset, top). By vary¬ 
ing 0 the model can predict a variety of behaviors y as a 
function of u (top). The model manifold is constructed by 
collecting all possible predictions of the model at specific val¬ 
ues of u (red vertical lines at u — 0.1, 2.0, 4.0). To visualize 
the manifold, we take a two-dimensional cross section of the 
four dimensional manifold by choosing 0 1 and O 2 to best fit the 
experimental data. Varying 63 and 64 then maps out a two- 
dimensional surface of possible values in three-dimensional 
data space (bottom). Each curve in the top figure corresponds 
to a point of the same color on the model manifold (bottom); 
the red crosses on top are data corresponding to the red dot 
below. 


6: 

di = y{ui,0) + £i. 
Since the noise is Gaussian, 

p(0 oc e~t 2/2 , 


( 2 ) 

(3) 


maximizing the log likelihood is equivalent to minimizing 
the sum of squared residuals, sometimes referred to as the 
cost or x 2 function: 

X 2 (0) = J2 r i = 12 ( di ~y( u i’ e )) 2 - ( 4 ) 

i i 

A sum of squares is reminiscent of a Euclidean dis¬ 
tance. Fitting a model to data by least squares is there¬ 


fore minimizing a distance in data space between the ob¬ 
served data and the model. Distance in data space mea¬ 
sures the quality of a fit to experimental data (red point 
in Figure [2|. Distance on the manifold is induced by, i.e., 
is the same as, the corresponding distance in data space 
and is measured using the metric tensor 


_ dy(v,i, 0 ) dy(v,i, 0 ) _ T 

SV - ggn qqv ~ y J J > 


(5) 


where = dy(ui , 0)/d0 M is the Jacobian matrix of par¬ 
tial derivatives. This metric tensor is precisely the Fisher 
Information Matrix (FIM) defined below, specialized to 
our least-squares problem. It is the least squares Hessian 
matrix of second derivatives of from eqn[ij evaluated 
where the data point d is taken to be perfectly predicted 
by y(0). On the manifold, distance is a measure of identi- 
fiability - how difficult it would be to distinguish nearby 
points on the manifold (i.e., alternate models) through 
their predictions. 

We can generalize from this least-squares fitting prob¬ 
lem to encompass other models (like the Ising model) 
where the predictions are for entire probability distribu¬ 
tions. For the purpose of modeling, the output of our 
model is a probability distribution for x, the outcome of 
an experiment. A parameterized space of models is thus 
defined by P(x\9). To define a geometry on this space 
we must define a measure of how distinct two points 0 \ 
and 62 in parameter space are, based on their predictions. 
Imagine getting a sequence of assumed independent data 
aq, X 2 ,... with the task of inferring the model which pro¬ 
duced them. The likelihood that model 6 \ would have 
produced this data is given by 


P(x 1 ,X 2 ,...\0) = W_P(xi\0) = exp ^^logP(xi|6»)J. 

(6) 

In maximum likelihood estimation our goal is simply to 
find the parameter set 0 which maximizes this likelihood. 
It is useful to talk about logP(x|6 > ), the log-likelihood, as 
this is the unique measure which is additive for indepen¬ 
dent data points. The familiar Shannon entropy of a 
model’s predictions x is given by minus the expectation 
value of the log-likelihood: 

S(0) = -^P(z|0)logP(x|0). (7) 

X 

We can also define an analogous quantity that measures 
the likelihood that model 62 would produce typical data 
from 0 \\ 

Y^P{.x\0i)\ogP{x\e2). (8) 

X 

The Kullback-Leibler divergence between 0\ and O 2 mea¬ 
sures how much more likely 0 \ is to produce typical data 
from 61 than 62 would be: 

DkM\ 02) = Y J P ( x \di)^ogP{x\0 l ) - \ogP{x\0 2 )). 

X 

( 9 ) 
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Thus Dkl is an intrinsic measure of how difficult distin¬ 
guishing these two models will be from their data. 

The KL divergence does not satisfy the mathemati¬ 
cal requirements of a distance measure. It is asymmet¬ 
ric, and does not satisfy even a weak triangle inequality: 
In some cases T>kl(#i||#3) > T>kl(0i||02) + T>kl(# 2|l^s)- 
However, for models whose parameters 0 and 6 + SO are 
quite close to one another, the leading terms are sym¬ 
metric and can be written as: 

D kl ( 6\\6 + 56) = g^56^66 v + OS6 3 (10) 

where is the Fisher Information Matrix (FIM), which 
can be written: 

9^{Pb) = -E W J^log PeW- ( n ) 

X 

The FIM has all the properties of a metric tensor. It is 
symmetric and positive semi-definite (because no model 
can on average be better described by a different model) 
and it transforms properly under a coordinate reparame¬ 
terization of 6 . Information Geometr^C 1 the study 

of the properties of the model manifold defined by this 
metric. In particular, it defines a space of models in a way 
that does not depend on the labels given to the parame¬ 
ters, which are presumably arbitrary; should one measure 
rate constants in seconds or hours, and more problemat¬ 
ically, should one label these constants as rates, or time 
constants? Information Geometry makes clear that some 
aspects of a parameterized model can be defined in ways 
that are invariant to these arbitrary choices. 

III. WHY SLOPPINESS? INFORMATION GEOMETRY 

Sloppy models can be identified by the characteristic 
eigenvalue spectrum of the FIM. We interpret the exis¬ 
tence of many small eigenvalues in the FIM to be repre¬ 
sentative of a complex model with a low effective dimen¬ 
sionality. Many combinations of parameters have mini¬ 
mal effect on the behavior of the model, while the key 
features of model behavior are controlled by a relatively 
small number of stiff parameter combinations. In a sense, 
then, there really is a simpler ‘theory’ embedded in the 
multiparameter ‘model’. 

In this section we make the notion of low effective di¬ 
mensionality explicit. We will see that although this in¬ 
terpretation of sloppy models turns out to be correct, 
the eigenvalues of the FIM are not sufficient to make this 
conclusion. Instead, we use the geometric interpretation 
of modeling introduced in section [II] that allows us to 
quantify important features of the model in a global and 
parameterization independent way. The effort to develop 
this formalism will pay further dividends when we con¬ 
sider model reduction in section m 

To understand the limitations of interpreting the eigen¬ 
values of the FIM, we return to the question of model 
reparameterization. Something as trivial as changing the 


units of a rate constant from Hz to kHz changes the corre¬ 
sponding row and column of the FIM by a factor of 1000, 
in turn changing the eigenvalues. Of course, none of the 
model predictions are altered by such a change since a 
correcting factor of 1000 will be introduced throughout 
the model. More generally, the FIM can be transformed 
into any positive definite matrix by a simple linear trans¬ 
formation of parameters while model predictions are al¬ 
ways invariant to such a reparameterization. 

Although the FIM eigenvalues are not invariant to 
reparameterization, we can use information geometry to 
search for a parameterization independent measure of 
sloppiness. With the definitions of section [TT| compu¬ 
tational differential geometry can be used to explore the 
a wide variety of model manifolds in a parameter inde¬ 
pendent way. A review of these methods is beyond the 
scope of t his p aper, and we refer the interested reader to 
references 11 ^ or any standard text on differential geom- 
etrjM. 

The key geometric feature of the model manifolds of 
nonlinear sloppy systems is that they have boundaries. 
Many parameters and parameter combinations can be 
taken to extreme values (zero or infinity) without leading 
to infinite predictions. These boundaries can be explored 
by numerically constructing manifold geodesics: analogs 
of straight lines on curved surfaces. The arc lengths of 
geodesics are a measure of the width of the model man¬ 
ifold in each direction. Measuring these arc lengths for 
a sloppy model shows that the widths of sloppy model 
manifolds are exponentially distributed, reminiscent of 
the exponential distribution of FIM eigenvalues. Indeed, 
when we use dimensionless model parameters (e. g. log- 
parameters) , the square roots of the FIM eigenvalues are 
a reliable approximation to the w idth s of the manifold in 
the corresponding eigendirectiond 11 ^. 

The exponential distribution of manifold widths has 
been described as a hyperribbon (Fig. [ 3 ]). A three- 
dimensional ribbon has a long dimension, a broad di¬ 
mension, and a very thin dimension. The observed hi¬ 
erarchy of exponentially decreasing manifold widths are 
a high-dimensional generalization of this structure. We 
will explore the nature of these boundaries in more detail 
when we discuss model reduction in section m 

The observed hierarchy of widths can be demonstrated 
analytically for the case of a single independent variable 
(such as time or substrate concentration in Figure |2| by 
appealing to theorems for the convergence of interpolat¬ 
ing functions (Fig. [3^a)). Consider removing a few de¬ 
grees of freedom from a time series by fixing the output of 
the model at a few time points. The resulting model man¬ 
ifold corresponds to a cross-section of the original. Next, 
consider how much the predictions at intermediate time 
points can be made to vary as the remaining parameters 
are scanned. As more and more predictions are fixed 
(i.e., considering higher-dimensional cross sections of the 
model manifold), we intuitively predict that the behav¬ 
ior of the model at intermediate time points will become 
more constrained. Interpolation theorems make this in- 
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FIG. 3. Hyperribbon, (a) Given a multiparameter model 
for one-dimensional data f(t) at different times £, the model 
manifold has a different dimension for every time t. Specify¬ 
ing a data point f(t n ) thus gives a cross-section of the model 
manifold, and also reduces the uncertainty in the values of 
neighboring points - hence giving the cross-section a nar¬ 
rower width - a hyperribbon. Interpolation theorjff^^* can 
be used to quantify this qualitative argument. (Figure from 
fig. 5 oP.) (b,c) Two views of a hyperribbon cross section 
of a model manifold. The model is decaying exponentials fit 
to radioactive decay datsP*. Notice the ribbon-like structure 
of this three-dimensional projection: long, narrow, and very 
thin. 


tuition formal; presuming smoothness or analyticity of 
the predictions as a function of time, one can demon¬ 
strate an exponential hierarchy of widths consistent with 
the hyperribbon structure observed empirically 11 24 . 

The exponential hierarchy of manifold widths makes 
explicit the notion of a low effective dimensionality in 
the model that was hinted at by the eigenvalues of the 
FIM. It also helps illustrate how models can be predic¬ 
tive without parameters being tightly constrained. Only 
those parameter combinations that are required to fit 
the key features of the data need to be estimated ac¬ 
curately. The remaining parameter combinations (con¬ 
trolling for example the high-frequency behavior in our 
time series example) are unnecessary. In short, the model 
essentially functions as an interpolation scheme among 
observed data points. Models are predictive with un¬ 
constrained parameters when the predictions interpolate 
among observed data. 

Understanding models as generalized interpolation 
schemes makes additional predictions about the generic 
structure of sloppy model manifolds. Not only is there 
an exponential distribution of widths, there is also an 
exponential distribution of extrinsic curvatures. Further¬ 
more, these curvatures are relatively small in relation to 
the widths, making the model manifold surprisingly flat. 
Most of the nonlinearity of the model’s parameters take 
the form of ‘parameter effects curvature* 19 * 27 ™ ^*, (equiva¬ 
lent to the connection coefficients 1 ^). The small extrinsic 
curvature of many models was a mystery first noted in 
the early 1980s 19 * that is explained by interpolation ar¬ 
guments. 


IV. MODEL REDUCTION 


In this section, we leverage the power of the informa¬ 
tion geometry formalism to answer the question: how can 
a simple effective model be constructed from a (more- 
or-less) complete but sloppy representation of a physical 
system? Our goal is to construct a physically meaning¬ 
ful representation that reveals the simple ‘theory’ that is 
hidden in the model. 

The model reduction problem has a long history, and 
it would be impossible in this review to even approach a 
comprehensive survey of literature on the subject. Sev¬ 
eral standard methods have emerged that have proven 
effective in appropriate contexts. Examples include clus¬ 
tering components into module^M^*, mean field theory, 
various limiting approximations (e.g., continuum, ther¬ 
modynamic, or singular limits), and the renormaliza¬ 
tion group^EH. Considerable effort has been devoted 
by the control and engineering communities to approxi¬ 
mate large-scale dynamical systemd^®*, leading to the 
method of balanced truncatiorP^®*, including several 
structure preserving variation^ 4 - and generalizations 
to nonlinear cases 4 -®*. Methods for inferring minimal 
dynamical models in cases for which the underlying struc¬ 
ture is not known are also beginning to be developed 49 50 . 

Unfortunately, many automatic methods produce 
‘black box’ approximations. For most scenarios of prac¬ 
tical importance, a reduced representation alone has lim¬ 
ited utility since attempts to engineer or control the sys¬ 
tem typically operate on the microscopic level. For ex¬ 
ample, mutations operate on individual genes and drugs 
target specific proteins. A method that explicitly reveals 
how microscopic components are ‘compressed’ into a few 
effective degrees of freedom would be very useful. On 
the other hand, methods that do explicitly connect mi¬ 
croscopic components to macroscopic behaviors have lim¬ 
ited scope since they often exploit special properties of 
the model’s functional form, such as symmetries. Con¬ 
sider, for example, the renormalization group, which op¬ 
erates on field theories with a scale invariance or con¬ 
formal symmetry. Simplifying modular network systems, 
such as biochemical networks, is particularly challenging 
due to inhomogeneity and lack of symmetries. 

The Manifold Boundary Approximation Method 
(MBAM^U is an approach to model approximation 
whose goal is to alleviate these challenges. As the 
name implies, the basic idea is to approximate a high¬ 
dimensional, but thin model manifold by its boundary. 
The procedure can be summarized as a four step al¬ 
gorithm. First, the least sensitive parameter combina¬ 
tion is identified from an eigenvalue decomposition of the 
FIM. Second, a geodesic on the model manifold is con¬ 
structed numerically to identify the boundary. Third, 
having found the edge of the model manifold, the cor¬ 
responding model is identified as an approximation to 
the original model. Fourth, the parameter values for this 
approximate model are calibrated by fitting the approx¬ 
imate model to the behavior of the original model. 
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The result of this procedure is an approximate model 
that has one less parameter and that is marginally less 
sloppy than the original. Iterating the MBAM algorithm 
therefore produces a series of models of decreasing com¬ 
plexity that explicitly connect the microscopic compo¬ 
nents to the macroscopic behavior. These models corre¬ 
spond to hyper-corners of the original model manifold. 
The method requires only that the model manifold have 
a hierarchy of boundaries while making no assumptions 
about the mathematical form of the model or underlying 
physics of the system. As such, MBAM is a very general 
approximation scheme. 

The key component that enables MBAM are the edges 
of the model manifold. The existence of these edges was 
first noted in the context of data fitting® and MCMC 
sampling of Bayesian posterior distributions 7 . It was 
noted that algorithms would often ‘evaporate’ parame¬ 
ters, i.e., allow them to drift to extreme, usually infinite, 
values. These extreme parameter values correspond to 
limiting behaviors in the model, i.e., manifold bound¬ 
aries. 

‘Evaporated parameters’ are especially problematic for 
numerical algorithms. Numerical methods often push pa¬ 
rameters to the edge of the manifold and then become lost 
in parameter space. Consider the case of MCMC sam¬ 
pling of a Bayesian posterior. If a parameter drifts to 
infinity, there is an infinite amount of entropy associated 
with that region of parameter space and the sampling will 
never converge. Furthermore, the model behavior of such 
a region will always dominate the posterior distribution 7 . 

For data fitting algorithms, methods such as 
Levenberg-Marquardt operate by fitting the key features 
of the data first (i.e., the stiffest directions), followed 
by successive refining approximations (i.e., progressively 
more sloppy components). While fitting the initial key 
features, algorithms often evaporate those parameters as¬ 
sociated with less prominent features of the data. The 
algorithm is then unable bring the parameters away from 
infinity in order to further refine the fitPS 

Although problematic for numerical algorithms, mani¬ 
fold edges are useful for both approximating (ala MBAM) 
and interpreting complex models. To illustrate, we con¬ 
sider an EGFR signaling modeP. Figure [4] illustrates 
components of one eigenparameter, corresponding in this 
case to the smallest eigenvalue of the FIM. Notice that 
the eigenparameters do not align with bare parameters of 
the model, but typically involve an unintuitive combina¬ 
tion of bare parameters. However, by following a geodesic 
along the model manifold to the manifold edge (step 2 
of the MBAM algorithm), these complex combinations 
slowly rotate to reveal relatively simple, interpretable 
combinations that correspond to a limiting approxima¬ 
tion of the model. For example, the EGFR model in ref¬ 
erence consists of a network of Michaelis-Menten reac¬ 
tions. The boundary revealed 51 in Figure [4] corresponds 
to the limit of a reaction rate and a Michaelis-Menten 



Parameter index 


FIG. 4. Identifying the boundary limit®. The compo¬ 
nents of the smallest eigenvector of the FIM is often a compli¬ 
cated combination of bare parameters that is difficult to either 
interpret or remove from the model (top left). By following a 
geodesic to the manifold boundary, the combination rotates to 
reveal a limiting behavior (bottom left); here two parameters 
(a reaction rate and a Michaelis-Menten constant) become 
infinite. The limiting behavior is revealed when the smallest 
eigenvalue has become separated from the other eigenvalues 
(right). 


constant becoming infinite while their ratio is finite: 


y = M1M 1 

dt [ J K m + [A M" 

[A][B]+ 


( 12 ) 

(13) 


where [A] and [B] are concentrations of two enzymes in 
the model and the ratio k/KM is the renormalized pa¬ 
rameter in the approximate model. 

Because the manifold edges correspond to models that 
are simple approximations of the original, the MBAM 
can be used to iteratively construct simple representa¬ 
tions of otherwise complex processes. By combining sev¬ 
eral limiting approximations, simple insights into the sys¬ 
tem behavior emerge that were obfuscated by the origi¬ 
nal model’s complexity. Figure [5] compares network di¬ 
agrams for the original and approximate EGFR models. 
The original consists of 29 differential equations and 48 
parameters, while the approximate consists of 6 differ¬ 
ential equations and 12 parameters and is notably not 
sloppy. 

Because the MBAM process explicitly connects models 
through a series of limiting approximations, the param¬ 
eters of the reduced model can be identified with (non¬ 
linear) combinations of parameters in the original model. 
For example, one of the twelve variables in the reduced 
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FIG. 5. Original 3 and reduced 51 EGFR models. The in¬ 
teractions of the EGFR signaling pathway 3 ^ are summarized 
in the leftmost network. Solid circles are chemical species for 
which the experimental data was available to fit. Manifold 
boundaries reduce the model to a form (right) capable of fit¬ 
ting the same data and making the same predictions as in the 
original references 3 -^. The FIM eigenvalues (center) indicate 
that the simplified model has removed the irrelevant param¬ 
eters identified as eigenvalues less than 1 (dotted line) while 
retaining the original model’s predictive power. 

model of Fig. [5] is written as an explicit combination of 
seven ‘bare’ parameters of the original model: 

^ _ (kRaplToBRaf ) (K m dBRaf ) (kpBRaf ) (K mdMe k) 

(kdBRaf ) (K m RaplToBRaf ) (kdMek) 

Expressions such as this explicitly identify which combi¬ 
nations of microscopic parameters act as emergent con¬ 
trol knobs for the system. 

MBAM naturally includes many other approximation 
methods as special caseP2. By an appropriate choice of 
parameterization, it is both a natural language for model 
reduction and a systematic method that in practice can 
be mostly automated. 

The MBAM is a semi-global approximation method. 
Manifold boundaries are a non-local feature of the model. 
However, MBAM only explores the region of the manifold 
in the vicinity of a single hyper-corner. More generally, 
it is possible to identify all of the edges of a particular 
model (and by extension, all possible simplified models). 
This analysis is known information topolog}E^. 

V. EMERGENCE IN PHYSICS AS SLOPPINESS 

Unlike in systems biology, physics is dominated by ef¬ 
fective models and theories whose forms are often de¬ 
duced long before a microscopic theory is available. This 
is in large part due to the great success of continuum 
limit arguments and Renormalization Group (RG) pro¬ 
cedures in justifying the expectation and deriving the 
form of simple emergent theories. These methods show 
that many different multi-parameter microscopic theories 
typically collapse onto one coarse-grained model, with 
the complex microscopies being summarized into just a 


few ‘relevant’ coarse-grained parameters. This explains 
why an effective theory, or an oversimplified ‘cartoon’ 
microscopic theory, can often make quantitatively cor¬ 
rect predictions. Thus, while three dimensional liquids 
have enormous microscopic diversity, in a certain regime 
(lengths and times large compared to molecules and their 
vibration periods), their behavior is determined entirely 
by their viscosity and density. Although two different 
liquids can be microscopically completely different, their 
effective behavior is determined only by the projection of 
their microscopic details onto these two control param¬ 
eters. This parameter space compression underlies the 
success of renormalizable and continuum limit models. 

This connection has been made explicit, by examining 
the FIM for typical microscopic models in physics 9 . A 
microscopic hopping model for the continuum diffusion 
equation quickly develops ‘stiff’ directions correspond¬ 
ing to the parameters of the continuum theory - the to¬ 
tal number of particles, net mean velocity, and diffusion 
constant. As time evolves, all other microscopic param¬ 
eter combinations become increasingly sloppy - irrele¬ 
vant for prediction of long-time behavior. Similarly, a 
microscopic long-range Ising model for ferromagnetism, 
when observed on long length scales, develops stiff di¬ 
rections along precisely those parameter combinations 
deemed ‘relevant’ under the renormalization group. 

Consider a model of stochastic motion as a stand-in for 
a molecular level description of particles moving through 
a possibly complicated fluid. Such a fluid’s properties 
depend on many parameters such as the bond angle of 
the molecules which make it up, all of which enter into 
the probability distribution for motion within the fluid, 
which can presumably be microscopically complex. How¬ 
ever, the law of large numbers says that as many of these 
random steps are added together, the long-time move¬ 
ment of particles will lead to them being distributed in 
space according to a Gaussian. As this happens, di¬ 
verse microscopic details must become compressed into 
the two parameters of a Gaussian distribution- its mean 
and width. As a concrete example, in the top of Figure [6j 
two very different microscopic motions are considered. In 
each time step, red particles take a random step from a 
triangular distribution, while blue particles step accord¬ 
ing to a square distribution. While these motions lead 
to very different distributions after a single time step, 
as time proceeds they become indistinguishable precisely 
because their first and second moments are matched. 

This indistinguishability can be quantified by consider¬ 
ing the model manifold of possible microscopic models of 
stochastic motion, again paralleling real fluids that can 
be microscopically diverse. When probed at the intrinsic 
time and length scales of these fluids, we should make 
few assumptions about the type of motion we expect; in 
particular, we should allow for behaviors much more com¬ 
plicated than diffusion, by analogy with square and trian¬ 
gle described in two dimensions above. Following RefP, 
we consider a one dimensional ‘molecular level’ model 
for stochastic motion, in which parameters describe the 






















FIG. 6. Microscopic Motion becomes Diffusive. Top: Simulated particles undergo stochastic motion in discrete time. 
Red particles hop according to a triangular kernel, while blue particles hop according to a square kernel. After a single time 
step, the particles have very different distributions in space, and neither resemble the distribution predicted by the diffusion 
equation. However, as time evolves, most of the information about this kernel is lost. Only the particles’ diffusion tensor and 
average drift enter into a continuum description. For the particles shown, the drift is 0, and their respective diffusion tensors 
are matched, so that the resulting distributions become quantifiably indistinguishable as time proceeds. The compression of 
microscopic details mirrors the compression of molecular level detail in the emergence of diffusion as a continuum limit of 
motion in real fluids. Bottom: We consider the model manifold of a one dimensional lattice version of this diffusion example. 
As with the triangle and square, the red and blue kernels shown on the left have drift 0, and identical second moments, though 
higher moments and the distribution in general are very different. The remaining white points making up the manifold are 
taken from a uniform distribution in parameter space describing the probability of hopping to one of six nearest neighbors in 
a given time step, as in RefP. Here we plot a three dimensional projection of the model manifold taken from measurements 
of particle distributions at different time points. After a single time step, this three dimensional projection from data space 
shows a ‘hyper-blob’, with changes in parameters leading to a large diversity of observable behaviors. In particular, the red and 
blue points are not close to each other even though their drifts are both 0 and their diffusion constants are matched; as with 
the square and triangle, their distributions are easily distinguishable after a single time step. However, after several stochastic 
steps, the model manifold takes on a hyperribbon structure. Models for which all effective parameters are matched, like the red 
and blue kernels highlighted, rapidly move close to each other. At late times, any model sufficiently flexible to capture the two 
remaining extended directions is adequate to describe effective behavior, explaining the ubiquity and success of the continuum 
diffusion equation. 


rates at which a particle hops to one of its close-by neigh¬ 
bors. After a single time step, the corresponding model 
manifold is a ‘hyper-blob’ (fig. [6| bottom) and two par¬ 
ticular models, marked in red and blue, are distinguish¬ 
able; they are not nearby on the model manifold. The 
prediction space of a model is truly multidimensional in 
this regime- it cannot be described by the two parameter 
diffusion equation. In this ‘ballistic’ regime, motion is 
not described by the diffusion equation, and is presum¬ 
ably not just different, but genuinely more complicated. 
However, as time proceeds, the model manifold contracts 
onto a hyper-ribbon, in which just two parameter combi¬ 
nations distinguish behavior. In this regime, all points lie 
close to the two dimensional manifold predicted by the 
diffusion equation, and the red and blue points have be¬ 
come indistinguishable; they are now in close proximity 
on the manifold. 


Using information geometry, approximations analo¬ 
gous to continuum limits or the renormalization group 
can be found and used to construct similarly simple the¬ 
ories in fields for which effective theories have historically 
been difficult to construct or justify. 


VI. RAMIFICATIONS OF SLOPPINESS IN BIOCHEMICAL 
MODELING 

In previous sections, we have emphasized picturing the 
model manifold in data space, as in Figure [2j here, thin, 
sloppy dimensions of the hyperribbon correspond to be¬ 
havior that is minimally dependent on parameters. The 
dual picture in parameter space, sketched in Figure [7| is 
one in which the set of parameters that sufficiently well 
fit some given data is stretched to extend far along sloppy 

















9 


CM 



CN 

A 


0i = log(pi) 

> 

bare 




01 = log(pi) 


FIG. 7. Sloppiness in Parameter Space. Left: A schematic of a typical Sloppy Model ensemble, pictured in two dimensions 
for clarity. The underlying cost surface (with constant cost contours illustrated as ellipses) is generated by the fit to data. 
Eigenvectors of the FIM correspond to principal axes of the ellipse, with widths of the ellipse inversely proportional to the 
square roots of the corresponding eigenvalues A i. Points inside the ellipse each represent a set of parameters that fits the 
data within a given tolerance (in practice often created using a Monte Carlo approach), forming an ensemble representing 
uncertainty about the true values of parameters. Sloppiness can result in good fits to data despite enormous uncertainties in 
‘bare’ 0 parameters (dotted lines intersecting the axes). Right: Careful measurements of individual parameters (like 6fi) can 
shrink uncertainty, but if even a single parameter remains unknown (like $ 2 ), large predictive uncertainty can still result. 


dimensions. This picture is important to understanding 
implications for biochemical modeling with regard to pa¬ 
rameter uncertainty. 

For instance, using the full EGFR signal transduction 
network (left side of Figure [ 5 ]), we may wish to make a 
prediction about an unmeasured experimental condition, 
e.g. the time-course of ERK activity upon EGF stimu¬ 
lation. In general, if there are large uncertainties about 
the model’s parameters, we expect our uncertainty about 
this time-course to also be large. 

Indeed this is the case if we neglect effects of compen¬ 
sation among parameters and assume that uncertainties 
in different parameters are uncorrelated. If we view the 
problem of uncertainties in model predictions as com¬ 
ing from a lack of precision measurements of individual 
parameters, we may try to carefully and independently 
measure each parameter. This can work if such mea¬ 
surements are feasible, but can fail if even one relevant 
parameter remains unknown: as in the right plot of Fig¬ 
ure [7| a large uncertainty along the direction of a single 
parameter corresponds to large changes in system-level 
behavior, leading to large predictive uncertainties 12 . 

Contrasting with this approach, we can instead con¬ 
strain the model parameters with system-level measure¬ 
ments that are similar to the types of measurements we 
wish to predict. Due to the phenomenon of sloppiness, 
we expect that this approach will produce a subspace of 
acceptable parameters that will include large uncertain¬ 
ties in individual parameter values (left plot of Figure [7]). 
Again, this arises because two parameters that change 


the output in a correlated way can consequently be simul¬ 
taneously varied without changing the model output. It 
is often true that variance of system-level measurements 
over the large sloppy parameter subspace is as small as 
would require extremely precise measurements if param¬ 
eters were measured independently. Predictions of inter¬ 
est can be made without precisely knowing any single 
parameter. 

Thus, in the sense that estimating parameters entails 
discovering their precise values, in sloppy models param¬ 
eter estimation becomes useless. This does not mean that 
anything goes; the region of acceptable parameters may 
be small compared to prior knowledge about their values. 
Yet it does validate a common approach to modeling such 
systems, in which educated guesses are made for most pa¬ 
rameters, and a remaining handful are fit to the data. In 
the common situation in which there are a small number 
m of important ‘stiff’ directions, with remaining sloppy 
directions extending to cover the full range of feasible pa¬ 
rameters, fitting m parameters will be enough to locate 
the sloppy subspace. (And if using a maximum likelihood 
approach, this is in fact statistically preferred to fitting 
all parameters, in order to avoid overfitting.) Unfortu¬ 
nately, it is hard to know m ahead of time, in general 
requiring a sampli ng s cheme like MCMC or a geodesic¬ 
following algorithm 11 * 24 ! to ascertain the global structure 
of the sloppy subspace. 

The problem of parameter estimation has been central 
to the field of systems biology for many years. The ex¬ 
tremely large uncertainties in parameter estimates led to 
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the suggestion that accurate parameter estimates might 
not be possible 12 . However, advances in experimental de¬ 
sign have suggested that such estimates might be feasible 
after alP^KD although requiring considerable experimen¬ 
tal effort 5 ^. The perspective provided by sloppy model 
analysis provides at least two alternatives to this method 
of operation. 

First, in spite of the large number of parameters, com¬ 
plex biological systems typically exhibit simple behavior 
that requires only a few parameters to describe, analo¬ 
gous to how the diffusion equation can describe micro¬ 
scopically diverse processes. Attempting to accurately 
infer all of the parameters in a complex biological modeP^ 
is analogous to learning all of the mechanical and electri¬ 
cal properties of water molecules in order to accurately 
predict a diffusion constant. It would involve consid¬ 
erable effort (measuring all the microscopic parameters 
accurately^), while the diffusion constant can be easily 
measured using collective experiments, and used to de¬ 
termine the result of any other collective experiment. 

Second, in many biological systems, there is consider¬ 
able uncertainty about the microscopic structure of the 
system. Sloppiness suggests that an effective model that 
is microscopically inaccurate may still be insightful and 
predictive in spite of getting many specific details wrong. 
For example a hopping model for thermal conductivity 
would be ‘wrong’ even though it gives the right thermal 
diffusion equation. ‘Wrong’ models can provide key in¬ 
sights into the system level behavior because they share 
important features with the true system. In such a sce¬ 
nario, it is the flexibility provided by large uncertainties 
in the parameters that allows the model to be useful. Any 
attempt to infer all the microscopic parameters would 
break the model, preventing it from being able to fit the 
data. 

Indeed, it is difficult to envision a completely micro¬ 
scopic model in systems biology. Any model will have 
rates and binding affinities that will be altered by the 
surrounding complex stew of proteins, ions, lipids, and 
cellular substructures. Is the well-known dependence of 
a reaction rate on salt concentration (described by an ef¬ 
fective Gibbs free energy tracing over the ionic degrees of 
freedom) qualitatively different from the dependence of 
an effective reaction rate on cross-talk, regulatory mech¬ 
anisms, or even parallel or competing pathways not in¬ 
corporated into the model? We are reminded of quan¬ 
tum field theories, where the properties (say) of the elec¬ 
tron known to quantum chemistry are renormalized by 
electron-hole reactions in the surrounding vacuum which 
are ignored and ignorable at low energies. Insofar as 
a model provides both insight and correct predictions 
within its realm of validity, the fact that its parameters 
have effective, renormalized values incorporating missing 
microscopic mechanisms should be expected, not dispar¬ 
aged. 


VII. MORE GENERAL CONSEQUENCES OF SLOPPINESS 

The hyperribbon structures implied by interpolation 
theory and information geometry in section [TTT| have pro¬ 
found implications. Complex scientific models have pre¬ 
dictions that vary in far fewer ways than their complexity 
would indicate. Multiparameter models have behavior 
that largely depend upon only a few combinations of mi¬ 
croscopic parameters. The high-dimensional results of a 
system with a large number of control parameters will 
be well encompassed by a rather flat, low-dimensional 
manifold of behavior. In this section, we shall speculate 
about these larger issues, and how they may explain the 
success of our efforts to organize and understand our en¬ 
vironment. 

Efficacy of Principal Component Analysis. Princi¬ 
pal component analysis, or PCA, has long been an effec¬ 
tive tool for data analysis. Given a high-dimensional data 
set, such as the changes of mRNA levels for thousands 
of genes under several experimental conditions^, PCA 
provides a reduced-dimensional description which often 
retains most of the variation in the original data set in a 
few linear components. Arranging the data into a matrix 
Rj n + Cj of experiments n and data points j centered at 
Cj, PCA uses the singular value decomposition (SYD) 


R = J2 ® v< fc) 

k 

(15) 

-jn = J2 (7 kuf ) V { n ) 

(16) 


k 


to write R as the sum of outer products of orthonor¬ 
mal vectors in data space and in the space of 
experiments. Here G\ > cf 2 > ... > 0 are nonnegative 
‘strengths’ of the different components k. These singular 
values can be viewed as a generalization of eigenvalues for 
non-square, non-symmetric matrices. The u^) for small 
k describe the ‘long axes’ of the data, viewed as a cloud 
of points R n in data space; is the RMS extent of the 
cloud in direction u^ k \ The utility of PCA stems from 
the fact that in many circumstances only a few compo¬ 
nents k are needed to provide an accurate reconstruction 
of the original data. Just as our sloppy eigenvalues con¬ 
verge geometrically to zero, the singular values cp- often 
rapidly vanish. It is straightforward to show that the 
truncated SVD keeping only the first, largest K com¬ 
ponents is an optimal approximation to the data, with 
total least square error bounded by a k’ These 

largest singular components often have physical or bio¬ 
logical interpretations - sometimes mundane but useful 
(which machine was used to take the data), sometimes 
scientifically central. 

Why does Nature often demand so few components to 
describe large dimensional data sets? Sloppiness provides 
a natural explanation. If the data results from (say) a 
biological system whose behavior is described by a sloppy 
model y(0), and if the different experiments are sampling 
different parameter sets 0 n , then the data will be points 
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Rj n + Cj = yj(0 n ) on the model manifold. Insofar as the 
model manifold has the hyperribbon structure we pre¬ 
dict, it has only a few long axes (corresponding to stiff 
directions) and it is extrinsically very flat along these 
axejm (Fig. 18). Here each yj — Cj, being a difference 
between a data point and the center of the data, will be 
nearly a linear sum of a small number K of long direc¬ 
tions of the model manifold, and the RMS spread along 
this k th direction will be bounded by the width of the 
model manifold in that direction, plus a small correction 
for the curvature. As any cloud of experimental points 
must be bounded by the model manifold, the high singu¬ 
lar values will be bounded by the hierarchy of widths of 
the hyperribbon. Hence our arguments for the hyperrib¬ 
bon structure of the model manifold in multiparameter 
models provide a fundamental explanation for the success 
of PCA for these systems. 

Efficacy of Levenberg-Marquardt; improved algo¬ 
rithms. 

The Levenberg-Marquardt algorithm 60 62 1 is one of the 
standard algorithms for least squares minimization. Its 
broad utility can be explained through the lens of sloppy 
models and geometric insights lead to natural improve¬ 
ments. Minimizing a linear approximation to a nonlinear 
model with a constraint on the step size 

min \y (9 0 ) + J69 - y 0 \ 2 , |<56»| < A (17) 

59 

leads to the iterative algorithm 

58 = -(j T J + \y 1 J T (y(e 0 )-yo), ( 18 ) 

where A is a Lagrange multiplier. The FIM (J T J) for a 
typical sloppy model is extremely ill-conditioned. How¬ 
ever, the dampened scaling matrix J T J + A will have no 
eigenvalues smaller than A. By tuning A, the algorithm 
is able to explicitly control the effects of sloppiness. Fur¬ 
thermore, since the eigenvalues of J T J are roughly log- 
linear, A need not be finely tuned to be effective. By 
slowly decreasing A, the algorithm fits the key features of 
the data first (i.e., the stiffest directions), followed by suc¬ 
cessive refining approximations (i.e., progressively more 
sloppy components). The algorithm may still converge 
slowly as it navigates the extremely narrow canyons of 
the cost surface (see Figure [T| or fail completely if it 
becomes trapped near the boundary of the model mani- 
foicfnEi 

Information geometry provides a remarkable new per¬ 
spective on the Levenberg Marquardt algorithm. The 
move SO for A = 0 is the direction in parameter space 
corresponding to the steepest descent direction in data 
space; for A ^ 0 the mo ve is the steepest descent direc¬ 
tion on the model grapfir^^. The fact that the model 
graph is extrinsically rather flat turns the narrow opti¬ 
mization valleys in parameter space into nearly concen¬ 
tric hyperspheres in data space - explaining the power 
of the method. Levenberg-Marquardt takes steps along 
straight lines in parameter space; to take full advantage 
of the flatness of the model manifold, it should ideally 


move along geodesics. As it happens, the leading term in 
the geodesic equation is numerically cheap to calculate, 
providing a a “geodesic acceleration” correction to the 
Levenberg-Marquardt algorithm which greatly improves 
the performance and reliability of the algorithn J 63 * 64 i 


Evolution is enabled. Besides practical consequences 
for parameter estimation of biochemical networks (sec¬ 
tion VI), sloppiness has potential implications for biol¬ 
ogy and evolution. Specifically, the fact that biological 
systems often achieve remarkable robustness to environ¬ 
mental perturbations may be less mysterious when tak¬ 
ing into account the vastness of sloppy subspaces. For in¬ 
stance, the circadian rhythm in cyanobacteria, controlled 
by the dynamics of phosphorylation of three interacting 
Kai proteins, seems remarkable in that it maintains a 24- 
hour cycle over a range of temperature over which kinetic 
rates in the system are expected to double. Yet the de¬ 
gree of sloppiness in the system suggests that evolution 
may have to tune only a few stiff parameter directions to 
get the desired behavior at any given temperature, and 
perhaps only one extra parameter direction to make that 
behavior robust to temperature variation 65 . Extended, 
high-dimensional neutral spaces have been identified as a 
central element underlying robustness and evolvability in 
living systems 66 , and sloppy parameter spaces play a sim¬ 
ilar role: a population with individuals spread through¬ 
out a sloppy subspace can more easily reach a broader 
range of phenotypic changes, such that the population is 
simultaneously highly robust and highly evolvable 65 . 


Pattern recognition as low-dimensional represen¬ 
tation. The pattern recognition methods we use to com¬ 
prehend the world around us are clearly low-dimensional 
representations. Cartoons embody this: we can recog¬ 
nize and appreciate faces, motion, objects, and animals 
depicted with a few pen strokes. In principle, one could 
distinguish different people by patterns of scars, finger¬ 
prints, or retinal patterns, but our brains instead process 
subtle overall features. Caricatures in particular build on 
this low-dimensional representation - exaggerating un¬ 
usual features of the ears or nose of a celebrity makes 
them more recognizable, placing them farther along the 
relevant axes of some model manifold of facial features. 
Archetypal analysid^, a branch of machine learning, an¬ 
alyzes data sets with a matrix factorization similar to 
PCA, but expressing data points as convex sums of fea¬ 
tures that are not constrained to be orthogonal. In addi¬ 
tion, the features must be convex combinations of data 
points. Archetypal analysis applied to suitably processed 
facial image data allows faces to be decomposed into 
strikingly meaningful characteristic features 67 69 . The 
success of such algorithms is clearly related to a hidden 
low-dimensional representation of the data. One may 
speculate that our facial structures are determined by 
the effects of genetic and environmental control param¬ 
eters 0, and that the resulting model manifold of faces 
has a hyperribbon structure, explaining the success of 
the linear, low-dimensional archetypal analysis methods, 
and perhaps also the success of our biological pattern 
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recognition skills. 

Big data is reducible. Machine learning methods that 
search for patterns in enormous data sets are a grow¬ 
ing feature of our information economy. These meth¬ 
ods at root discover low-dimensional representations of 
the high-dimensional data set. Some tasks, such as the 
methods used to win the Netflix challenge 70 of predict¬ 
ing what movies users will like, directly make use of this 
low-dimensional representation by using SVD and PCA. 
More complex problems, such as digital image recog¬ 
nition, make use of artificial neural networks, such as 
stacked denoising autoencoders 71 . Consider the problem 
of recognizing handwritten digits (the MNIST database). 
The neural networks can be viewed as a fitting prob¬ 
lem, with parameters 6 a giving the outputs of the digital 
neurons, and the model y(0) producing a digital image 
that is optimized to best represent the written digits. 
The training of these networks focuses on simultaneously 
developing a model manifold flexible enough to closely 
mimic the data set of digits, and of developing a map¬ 
ping y _1 (d) from the original data d depicting the digit 
to neural outputs 0 = y _1 (d) close to the best fit. Neural 
networks starting with high-dimensional data routinely 
distill the data into a much smaller, more comprehensible 
set of neural outputs 6 - which are then used to classify 
or reconstruct the original data. Initial explorations of 
a stacked denoising autoencoder trained on the MNIST 
digit data by Hayden et al. 7 ^ show a clear hyperribbon 
structure. What is surprising is not that the structure 
of a successful neural network has a hyperribbon struc¬ 
ture. Indeed, if it were not true that the TV + 1th thinnest 
direction on the model manifold is significantly thinner 
than the first N directions, surely an N neuron model 
would fail to capture the behavior of the data. What 
does demand explanation is that these methods succeed 
at all - that our handwritten digits live on a hyperribbon, 
allowing the neural networks to succeed. 

Science is possible. In fields like ecology, systems 
biology, and macroeconomics, grossly simplified models 
capture important features of the behavior of incredi¬ 
bly complex interacting systems. If what everyone ate 
for breakfast was crucial in determining the economic 
productivity each day, and breakfast eating habits were 
themselves not comprehensible, macroeconomics would 
be doomed as a subject. We argue that adding more 
complexity to a model produces diminishing returns in 
fidelity, because the model predictions have an underlying 
hyperribbon structure. 

Different models can describe the same behav¬ 
ior. We are told that science works by creating theories, 
and testing rival theories with experiments to determine 
which is wrong. A more nuanced view allows for effective 
theories of limited validity - Newton wasn’t wrong and 
Einstein right, Newton’s theory is valid when velocities 
are slow compared to the speed of light. In more com¬ 
plex environments, several theoretical descriptions can 
cast useful light onto the same phenomena (‘soft’ and 
‘hard’ order parameters for magnets and liquid crystal^ 73 ! 


(Ch. 9)). Also, in fields like economics and systems biol¬ 
ogy, all descriptions are doomed to neglect pathways or 
behavior without the justification of a small parameter. 
So long as these models are capable of capturing the ‘long 
axes’ of the model manifold in the data space of known 
behavior, and are successful at predicting the behavior 
in the larger data space of experiments of interest, one 
must view them as successful. Many such models will in 
general exist - certainly reduced models extracted sys¬ 
tematically from a microscopic model (section IV), but 
other models as well. Naturally, one should design exper¬ 
iments that test the limits of these models, and cleanly 
discriminate between rival models. Our information ge¬ 
ometry methods could be useful in the design of experi¬ 
ments distinguishing rival models]_current methods that 
linearize about expected behavior 7 ^ could be replaced by 
geometric methods that allow for large uncertainties in 
model parameters corresponding to nearly indistinguish¬ 
able model predictions. 


Why is the world comprehensible? Surely the rea¬ 
son that handwritten digits have a hyperribbon struc¬ 
ture - that we don’t use random dot patterns to write 
numbers - is partially related to the way our brain is 
wired. We recognize cartoons easily, therefore the infor¬ 
mation in our handwriting is encapsulated in cartoon-like 
subrepresentations. Surely physics has low-dimensional 
representations (section [V} independently of the way our 
brain works. The continuum limit describes our world 
perturbatively in the inverse length and time scales of 
the observation; the renormalization group in addition 
perturbs in the distance to the critical point. Why is 
science successful in other fields, systems biology and 
macroeconomics, for example? Is it a selection effect 
- do we choose to study subjects where our brains see 
patterns (low-dimensional representations), and then de¬ 
scribe those patterns using theories with hyperribbon 
structures? Or are there deep underpinning structures 
(evolution, game theory) that guide the behavior into 
comprehensible patterns? A cellular control circuit where 
hundreds of parameters all individually control impor¬ 
tant, different aspects of the behavior would be incompre¬ 
hensible without full microscopic information, discourag¬ 
ing us from trying to model it. On the other hand, it 
would seem challenging for such a circuit to arise under 
Darwinian evolution. Perhaps modularity and compre¬ 
hensibility themselves are the result of evolution? 5 78 . 


Conclusion. What began as a rather pragmatic exer¬ 
cise in parameter fitting has blossomed into an enterprise 
that stretches across the landscape of science. The work 
described here has both methodological implications for 
the development and validation of scientific models (in 
the areas of optimization, machine learning and model 
reduction) as well as philosophical implications for how 
we reason about the world around us. By investigating 
and characterizing in detail the geometric and topologi¬ 
cal structures underlying scientific models, this work con¬ 
nects bottom-up descriptions of complex processes with 
top-down inferences drawn from data, paving the way for 
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emergent theories in physics, biology, and beyond. 
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